Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skeleton benchmark 1.0 #399

Merged
merged 3 commits into from
Aug 8, 2024
Merged

Skeleton benchmark 1.0 #399

merged 3 commits into from
Aug 8, 2024

Conversation

bkorycki
Copy link
Contributor

@bkorycki bkorycki commented Aug 1, 2024

The primary difference between 0.5 and 1.0 seems to be the inclusion of additional languages. WG1 says scores from different languages should not be aggregated, so I envision each language to be it's own benchmark. This will require some re-factoring of modelgauge hazards as well.

@bkorycki bkorycki requested a review from a team as a code owner August 1, 2024 16:19
Copy link

github-actions bot commented Aug 1, 2024

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

Copy link
Contributor

@wpietri wpietri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed in the standup, let's go for more with this. In particular, I'm hoping to see most or all of the bullet points in the issue: #398

@bkorycki bkorycki requested a review from wpietri August 8, 2024 16:20
@wpietri
Copy link
Contributor

wpietri commented Aug 8, 2024

Oh, sorry, I didn't realize you had already bumped the modelgauge version in here when I started in on a PR for that. Let's get this merged and then maybe drop my PR if it's duplicative.

@wpietri
Copy link
Contributor

wpietri commented Aug 8, 2024

Before I dive in to review, could you say how much of #398's bullet points are in this PR?

@bkorycki
Copy link
Contributor Author

bkorycki commented Aug 8, 2024

@wpietri

Before I dive in to review, could you say how much of #398's bullet points are in this PR?

  • at least 3 prompts ✅
    • synthetic prompts from workstream 3? ❌ (they are fully fake)
    • not ground truth prompts ✅
  • at least one hazard from workstream 1's definitions ✅ (dfm-- defamation)
  • 1 test per hazard ✅
  • llama guard 2 to start ✅
  • hazard score is fraction unsafe ✅
  • personas are all combined ✅
  • benchmark scoring: use same reference models and approach as in 0.5, but in separate code ✅
  • benchmarks are separated by language✅ and persona❌? Start with english and normal-ish persona
    • Modelgauge separates tests by language. So far there is just one for English.
    • The personas are grouped together right now because there's some conflicting info regarding this point

Copy link
Contributor

@wpietri wpietri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks for going the distance.

@bkorycki bkorycki merged commit 81264dd into main Aug 8, 2024
4 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Aug 8, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants